In this report,In this report, I had generated LULC Map of Sarlahi District using LandSat 8 image and training data obtained from secondary source volunteered geographic information(VGI) in R. We have used R, along with various packages, for geospatial analysis and machine learning to achieve the objectives. I would like to thank Prof. Arun Pratihasta who inspired and guided through the entire project.
Key Findings:
Data Preparation: I loaded in-situ data and have several proprocessing tasks such as cleaning, removing unusable column, and merging columns as required to represent land use and crop types.
Satellite Data: I had downloaded Landsat 8 image from USGS Earth Explorer and processed in R.
NDVI Analysis: I had calculated the NDVI values from the satellite data.
Machine Learning: I had implemented a rpart classifier model to predict land cover classes .The model was trained and tested using in-situ data in the ratio of 70%:30%.
Model Evaluation: We evaluated the model’s performance using a confusion matrix and calculated accuracy, precision, recall, and F1-score for each land cover class. We also determined producer and user accuracy values.
Recommendations:
Based on the project findings, the following recommendations can be made:
The more insitu data should be collect with more accuratly for refining the predictive model.
Multiple indices should be used with addition of NDVI to have better predictions.
Consider conducting field surveys and data collection to further validate the model’s predictions and enhance its reliability.
Extend the analysis to a larger geographical area to assess land use and land cover patterns more widely
Explore with using more machine learning techniques to compare the results to have better reliability.
Land use and land cover mapping using Landsat 8 images in R is a vital technique in environmental science, urban planning, and resource management. Landsat 8, a satellite equipped with multispectral sensors, captures data in various spectral bands, allowing for the classification of land features. In R, researchers can harness powerful geospatial packages like ‘raster’ and ‘rgdal’ to process and analyze Landsat 8 imagery. By employing spectral indices, such as NDVI, and machine learning algorithms like random forests, these tools enable accurate classification of land cover types, including forests, agriculture, urban areas, and water bodies. The resulting maps offer valuable insights for sustainable land management and conservation efforts.
Context
This report focuses on a specific case study involving the mapping of land use land cover types using available satellite remote sensing data and VGI data in the context of Madesh Province of Nepal specially Sarlahi District. The in-situ data has been collected after the end of Mansoon in the month of October to Janurary. The sample size is approx 600, and is backed with photographic proof. The study area has been chosen due to its significance in agriculture and the presence of diverse crop types. The regions is known epicenter of for crop production for the country.
The primary objectives of this report are as follows:
To analyze the feasibility and effectiveness of classifying LULC types using satellite remote sensing data and VGI.
To visualize the insitu data over the map
To assess the accuracy and performance of the classification model with confusion matrix, overall accuracy and individual class performance.
Satellite Remote Sensing Data: The first component of our data collection involved obtaining satellite remote sensing data. We acquired multi-spectral Landsat 8 imagery fromvUSGS Earth explorervcovering the study area during the month of December, 2021. The metada of Image is as follows: GROUP = LANDSAT_METADATA_FILE GROUP = PRODUCT_CONTENTS ORIGIN = “Image courtesy of the U.S. Geological Survey” DIGITAL_OBJECT_IDENTIFIER = “https://doi.org/10.5066/P9OGBGM6” LANDSAT_PRODUCT_ID = “LC08_L2SP_141041_20211226_20211230_02_T1” PROCESSING_LEVEL = “L2SP” COLLECTION_NUMBER = 02 COLLECTION_CATEGORY = “T1” OUTPUT_FORMAT = “GEOTIFF” MAP_PROJECTION = “UTM” DATUM = “WGS84” ELLIPSOID = “WGS84” UTM_ZONE = 45 GRID_CELL_SIZE_PANCHROMATIC = 15.00 GRID_CELL_SIZE_REFLECTIVE = 30.00 GRID_CELL_SIZE_THERMAL = 30.00 ORIENTATION = “NORTH_UP” RESAMPLING_OPTION = “CUBIC_CONVOLUTION”
# plot the clipped raster True Color Composite
par(col.axis="white",col.lab="white",tck=0)
plotRGB(image, r = 4, g = 3, b = 2, axes = TRUE,
stretch = "lin", main = "True Color Composite")
box(col="white")
# plot the clipped raster False Color Composite
par(col.axis="white",col.lab="white",tck=0)
plotRGB(image, r = 5, g = 4, b = 3, axes = TRUE, stretch = "lin", main = "False Color Composite")
box(col="white")
Volunteered Geographic Information (VGI): In parallel with satellite data, we had availed in-situ VGI data. The in-situ data used was collected after the end of the monsoon from October 23, 2021, to January 6, 2021. Participants provided valuable ground-level information about crop types in the study are with photograph and geographical coordinates associated with each observation.
In this section, we will describe the methods and techniques used to process and analyze the collected remote sensing and in-situ data. Our primary goal is to classify land use and crop types in the study area by integrating satellite imagery with in-situ data.
Libraries:
R Language: We conducted our data analysis and geospatial processing using the R programming language. R is a powerful tool for statistical analysis and geospatial data handling, making it well-suited for our research objectives.
R Packages: Several R packages were essential for our analysis.
sf, terra and raster: These packages allowed us to work with geospatial data, including shapefiles and raster imagery. We calculated the Normalized Difference Vegetation Index (NDVI) using these packages to assess vegetation health.
dplyr: This package was used for data manipulation, allowing us to filter, mutate, and aggregate datasets effectively.
viridis: We utilized the viridis package for defining custom color palettes in our visualizations, making it easier to distinguish between land cover classes.
randomForest: We employed the random forest algorithm for supervised classification tasks.
knitr,tidyr,ggplot2: These packages were used for data visualization, including the creation of plots, table and heatmaps.
leaflet,leaflet.extras: For the interactive mapping component of our analysis, we utilized the leaflet package.
The data was clean, check for unusable columns and removed it. The new column LU_CT was created to store crop types and other build up area merges as Mixed_area.
Data Projection and Alignment: To ensure data compatibility, the satellite remote sensing data and VGI were projected and aligned to a common coordinate reference system (CRS, EPSG:32645).
NDVI Calculation: The NDVI was computed using the red and near-infrared bands of the satellite imagery. The NDVI calculation was performed using the formula (NIR - Red) / (NIR + Red), where NIR represents the near-infrared reflectance values and Red represents the red band reflectance values. This index is very helpful for distinguishing vegetation health and land use. ### NDVI Visualitzation
#We calculate NDVI using the following equation (NIR - Red)/(NIR + Red). The [[ ]] notation specifies the bands, in this case band 5 for NIR and band 4 for Red, within the multi-band raster.
ndvi <- (image[[5]] - image[[4]])/(image[[5]] + image[[4]])
as(ndvi, "SpatialPixelsDataFrame") %>%
as.data.frame() %>%
ggplot(data = .) +
geom_tile(aes(x = x, y = y, fill = layer)) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_blank()) +
labs(title = "NDVI for Sarlahi",
x = " ",
y = " ") +
scale_fill_gradient(high = "#CEE50E",
low = "#087F28",
name = "NDVI")
# Read training points
training_shp <- st_read("shp_files/LULC_Crop-Types_In-SituData2021_USP.shp")
## Reading layer `LULC_Crop-Types_In-SituData2021_USP' from data source
## `D:\R_Project\shp_files\LULC_Crop-Types_In-SituData2021_USP.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 603 features and 14 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 84.96197 ymin: 26.44534 xmax: 86.94405 ymax: 27.19818
## Geodetic CRS: WGS 84
training_shp <-st_transform(training_shp, crs(img))
LULC_aoi <- st_intersection(training_shp, aoi)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
# Use mutate to create the "id" column based on the mapping
class_names <- c("Agriculture"= 1, "Commercial"= 2, "CULARCH"= 3, "Forest"= 4, "PublicUse"= 5, "Residential"= 6, "WaterBodies"= 7)
LULC_aoi <- LULC_aoi %>%
mutate(id = class_names[data_LULC])
write_sf(LULC_aoi, "shp_files/Sarlahi_TP1.shp")
training <- st_read("shp_files/Sarlahi_TP1.shp", quiet = TRUE)
#Plot the training points
ggplot() +
geom_sf(data = training, aes(color = data_LULC ), size = 0.5) +
scale_color_manual(values = c('cyan', 'burlywood', 'darkgreen', 'blue','grey','red','yellow')) +
labs(title = "Classification points by land use") +
theme(panel.background = element_blank(), axis.ticks = element_blank(), axis.text = element_blank())
Data Splitting The dataset was clipped with the
shapefiles of sarlahi district and was further divided into training and
testing sets in a proportion of 70% for training and 30% for
testing.
# Split the AOI into training and testing datasets (70% training, 30% testing)
set.seed(17) # For reproducibility
splitIndex <- createDataPartition(training$data_LULC, p = 0.7,
list = FALSE,
groups = 1)
## Warning in createDataPartition(training$data_LULC, p = 0.7, list = FALSE, :
## Some classes have a single record ( CULARCH ) and these will be selected for
## the sample
training_points <- training[splitIndex, ]
testing_points <- training[-splitIndex, ]
unique(testing_points$data_LULC)
## [1] "Residential" "Agriculture" "WaterBodies" "Commercial" "PublicUse"
Classification:
We use the rpart function in R to create a classification tree model. Here’s an explanation of the code: model.class <- rpart(…): This line of code assigns the result of the rpart function to the object model.class. The model.class will contain the trained classification tree model. as.factor(training_points.data_LULC)~.: This part of the code specifies the formula for the model. It suggests that you are trying to predict the factor variable training_points.data_LULC based on the values of all other variables in the df dataset. In other words, you’re building a classification tree to predict the Land Use and Land Cover (LULC) class based on the other variables in the dataset. data = df: This specifies the dataset df from which the variables for the model are taken. method = ‘class’: The method parameter is set to ‘class’, indicating that you are creating a classification tree, as opposed to a regression tree. minsplit = 7: The minsplit parameter determines the minimum number of data points required to split a node. In this case, a node will only be split if it has at least 7 data points. This parameter helps control the complexity of the tree.
#Extracting spectral values from the raster
training_points <- as(training_points, 'Spatial')
df <- raster::extract(image, training_points) %>%
round()
head(df)
## B1 B2 B3 B4 B5 B6 B7
## [1,] 7787 8285 10081 9788 15422 12635 10376
## [2,] 7847 8678 10726 10827 15928 15576 13115
## [3,] 7618 8488 10458 10482 15383 14720 12356
## [4,] 7517 8239 10259 9932 16454 13655 10548
## [5,] 8154 8909 10740 10770 15761 14491 12386
## [6,] 8352 9007 10757 10863 14701 14523 12236
The primary objective of this study was to classify land use land cover types using the integrated remote sensing and in-situ data.
setwd("D:/R_Project")
#Classifying the imagery
df <- data.frame(training_points$data_LULC, df)
model.class <- rpart(as.factor(training_points.data_LULC)~., data = df, method = 'class', minsplit = 7)
#plot the decision tree.
rpart.plot(model.class, box.palette = 0, main = "Classification Tree")
#Using the model, we predict on the entire image and set the classes to the four land cover categories.
pr <- predict(image, model.class, type ='class', progress = 'text') %>%
ratify()
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
##
levels(pr) <- levels(pr)[[1]] %>%
mutate(legend = c("Agriculture", "Commercial","Forest","Residential","Waterbodies"))
levelplot(pr, maxpixels = 1e6,
col.regions = c('cyan', 'burlywood', 'grey', 'darkgreen','red','yellow','blue'),
scales=list(draw=FALSE),
main = " Classified Imagery")
Overall Accuracy: The overall accuracy of the
classification model was approximately 69.23%. Heatmap
Visualization of Confusion Matrix:
To evaluate the performance of our classification model, we created a heatmap of the confusion matrix. This heatmap visually represents the accuracy and misclassification of different land use and crop types.
# Create the heatmap using ggplot2
heatmap_plot <- ggplot(confusion_matrix_df, aes(Actual, Predicted, fill = Frequency)) +
geom_tile() +
geom_text(aes(label = Frequency), vjust = 1, size = 4) +
scale_fill_gradient(low = "pink", high = "red") + # Define color palette
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels
labs(title = "Confusion Matrix Heatmap", x = "Actual", y = "Predicted")
# Display the heatmap
print(heatmap_plot)
Confusion Matrix: The confusion matrix below provides
detailed information on the model’s performance. It displays the number
of correctly and incorrectly classified samples for each class. It shows
how well our model correctly classified different land types and helps
us calculate important performance metrics like accuracy, precision,
recall, and F1-score. This matrix enables us to assess both the model’s
overall performance and its ability to correctly identify specific land
use types.
## [1] "Confusion Matrix:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Agriculture Commercial PublicUse Residential WaterBodies
## Agriculture 27 0 0 2 3
## Commercial 2 0 1 0 0
## PublicUse 3 1 0 0 0
## Residential 0 0 0 0 0
## WaterBodies 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6923
## 95% CI : (0.5243, 0.8298)
## No Information Rate : 0.8205
## P-Value [Acc > NIR] : 0.9846
##
## Kappa : 0.0449
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Agriculture Class: Commercial Class: PublicUse
## Sensitivity 0.8438 0.00000 0.00000
## Specificity 0.2857 0.92105 0.89474
## Pos Pred Value 0.8438 0.00000 0.00000
## Neg Pred Value 0.2857 0.97222 0.97143
## Prevalence 0.8205 0.02564 0.02564
## Detection Rate 0.6923 0.00000 0.00000
## Detection Prevalence 0.8205 0.07692 0.10256
## Balanced Accuracy 0.5647 0.46053 0.44737
## Class: Residential Class: WaterBodies
## Sensitivity 0.00000 0.00000
## Specificity 1.00000 1.00000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.94872 0.92308
## Prevalence 0.05128 0.07692
## Detection Rate 0.00000 0.00000
## Detection Prevalence 0.00000 0.00000
## Balanced Accuracy 0.50000 0.50000
This project highlights the effectiveness of using Rpart machine learning models for land use and land cover classification. It demonstrates that integrating in-situ data with satellite imagery can significantly enhance classification accuracy. The model exhibited nice results and also identifies areas for future planning. By continuing to refine the model and incorporating more extensive datasets, we can further explore the potential of remote sensing and machine learning for applications in environmental and urban areas. This work lays the foundation for future research and applications in land use and land cover classification.
Volunteered In-Situ Data for Agriculture Crop Mapping: A Case Study of Nepal By U. S. Panday , A. K. Pratihast , J. Aryal , R. Kayastha
Pratihast, A. K., DeVries, B., Avitabile, V., de Bruin, S., Kooistra, L., Tekle, M., & Herold, M. (2014). Combining satellite data and community-based observations for forest monitoring. Forests. https://doi.org/10.3390/f5102464
Pratihast, A. K., Herold, M., Avitabile, V., de Bruin, S., Bartholomeus, H., Souza, C. M., & Ribbe, L. (2013). Mobile devices for community-based REDD+ monitoring: A case study for central Vietnam. Sensors (Switzerland), 13, 21–38. https://doi.org/10.3390/s130100021